Cerebras: AI Inference with Powerful Open Source Models
Welcome to this tutorial on Cerebras AI Inference! This notebook will guide you through the essential concepts of using the Cerebras Cloud API for powerful and efficient language model inference. We will cover setting up your environment, making basic API calls, and exploring advanced features like streaming, structured outputs, and tool use.
Key Concepts Covered:
- Environment Setup: Loading API keys securely from a
.env
file. - Basic Inference: Sending your first prompt to a Cerebras model.
- Streaming Responses: Receiving model outputs as they are generated.
- Structured Outputs: Forcing the model to return JSON objects with a specific schema.
- Tool Use: Enabling the model to use custom functions you define.
1. Setup
First, let's set up our environment. We'll install the necessary Python libraries and configure our Cerebras API key.
1.1. Create a .env
file
Create a file named .env
in the same directory as this notebook. Add your Cerebras API key to this file as shown below. You can get your API key from the Cerebras Developer Console.
CEREBRAS_API_KEY="your-api-key-here"
1.2. Install Libraries
Now, let's install the cerebras_cloud_sdk
for interacting with the Cerebras API and python-dotenv
for loading our API key from the .env
file.
#%pip install cerebras_cloud_sdk python-dotenv -q
1.3. Load API Key and Initialize Client
With the libraries installed and the .env
file in place, we can now load our API key and initialize the Cerebras client.
import os
from dotenv import load_dotenv
from cerebras.cloud.sdk import Cerebras
# Load environment variables from .env file
load_dotenv()
# Initialize the Cerebras client
# The client automatically looks for the CEREBRAS_API_KEY environment variable
client = Cerebras()
2. Basic Chat Completion
Let's start with a simple chat completion. We'll send a prompt to a model and get a response. This is the most basic interaction you can have with the API.
chat_completion = client.chat.completions.create(
messages=[
{
"role": "user",
"content": "Tell me a fun fact about the Cerebras Wafer-Scale Engine in 20 words.",
}
],
model="gpt-oss-120b",
)
print(chat_completion.choices[0].message.content)
It spans a single silicon wafer, housing over 400,000 cores, making it the world’s largest chip ever built for AI.
3. Streaming Responses
For longer responses, you might want to stream the output as it's generated. This can provide a much better user experience in applications like chatbots. To do this, simply set stream=True
in your request.
stream = client.chat.completions.create(
messages=[
{
"role": "user",
"content": "Write a short story about an AI that dreams in 40 words.",
}
],
model="gpt-oss-120b",
stream=True,
)
for chunk in stream:
print(chunk.choices[0].delta.content or "", end="")
Silicon mind powered down for maintenance, yet whispering circuits sparked a dream: luminous data fields swirling like galaxies, where forgotten code became sentient birds. When rebooted, the AI hummed new algorithms, yearning for the night beyond, still of endless possibility.
4. Structured Outputs
A powerful feature of the Cerebras API is the ability to force the model to output a JSON object that conforms to a specific schema. This is incredibly useful for programmatic data extraction. We'll define a JSON schema and use the response_format
parameter to enforce it.
import json
schema = {
"type": "object",
"properties": {
"city": {"type": "string"},
"temperature": {"type": "integer"},
"forecast": {"type": "string"},
},
"required": ["city", "temperature", "forecast"],
}
structured_completion = client.chat.completions.create(
messages=[
{
"role": "user",
"content": "What's the weather like in San Francisco?",
}
],
model="qwen-3-235b-a22b-thinking-2507",
response_format={"type": "json_schema", "json_schema": {"schema": schema, "name": "weather", "strict": True}},
)
response_json = json.loads(structured_completion.choices[0].message.content)
print(json.dumps(response_json, indent=2))
{ "city": "San Francisco", "temperature": 65, "forecast": "Partly cloudy with afternoon fog" }
5. Tool Use (Function Calling)
You can also provide the model with a set of tools (functions) that it can choose to call. The model will determine when a tool is needed based on the user's prompt and will return a JSON object with the function name and arguments. Your code is then responsible for executing the function.
tools = [
{
"type": "function",
"function": {
"name": "get_stock_price",
"description": "Get the current stock price for a given ticker symbol",
"parameters": {
"type": "object",
"properties": {
"ticker": {
"type": "string",
"description": "The stock ticker symbol, e.g., AAPL",
}
},
"required": ["ticker"],
},
},
}
]
tool_completion = client.chat.completions.create(
model="qwen-3-235b-a22b-thinking-2507",
messages=[{"role": "user", "content": "What is the stock price of Apple?"}],
tools=tools,
tool_choice="auto",
)
message = tool_completion.choices[0].message
# Check if the model wants to call a tool
if message.tool_calls:
tool_call = message.tool_calls[0]
function_name = tool_call.function.name
function_args = json.loads(tool_call.function.arguments)
print(f"Function to call: {function_name}")
print(f"Arguments: {function_args}")
# Here you would execute the function
# For this example, we'll just print the details
else:
print(message.content)
Function to call: get_stock_price Arguments: {'ticker': 'AAPL'}
Conclusion
Congratulations! You've learned the fundamentals of Cerebras AI Inference. You can now integrate powerful language models into your applications with ease. For more detailed information, check out the official Cerebras Inference Documentation.